Data Analysis Final Project¶

Due Date: January 29, 2026, 23:59

Group: Cuircuit Synergy

Created By: Jeremia Baumgartner, Lorenz Buchinger, Tim Zwölfer


Table of contents

  1. A. Data Preprocessing and Data Quality (70 points)
  2. B. Visualization and Exploratory Analysis (55 points)
  3. C. Probability and Event Analysis (45 points)
  4. D. Statistical Theory Applications (45 points)
  5. E. Regression and Predictive Modeling (45 points)
  6. F. Dimensionality Reduction and Statistical Tests (40 points)

Initial Setup

In [1]:
# Initial setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure plotting
plt.rcParams.update({
    'figure.figsize': [6, 4],
    'figure.dpi': 150,
    'figure.autolayout': True,
    'axes.labelsize': 8,
    'axes.titlesize': 10,
    'font.size': 6,
    'xtick.labelsize': 6,
    'ytick.labelsize': 6
})

sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)

np.random.seed(42)

Load Data

In [2]:
df_csv_data = pd.read_csv('traffic_accidents.csv')
df_csv_data_raw = df_csv_data # Keep a raw copy

A. Data Preprocessing and Data Quality (70 points)¶


Assigned to Lorenz

  • Dataset overview (dimensions, columns, types, time range, sampling rate, missingness summary) (10 points)
  • Basic statistical analysis using pandas (descriptives, grouped stats, quantiles) (10 points)
  • Original data quality analysis with visualization (missingness patterns, outliers, dupli- cates, timestamp gaps, inconsistent units) (20 points)
  • Data preprocessing pipeline (cleaning steps, handling missing data, outliers strategy, re- sampling or alignment if needed, feature engineering basics) (20 points)
  • Preprocessed vs original comparison (before/after visuals plus commentary on what changed and why) (10 points)

Dataset Overview¶

Dimensions

In [3]:
print("The dataset contains", df_csv_data.shape[0], "rows and", df_csv_data.shape[1], "columns.")
The dataset contains 209306 rows and 24 columns.

Columns and types

In [4]:
schema_table = (
    df_csv_data.dtypes
    .reset_index()
    .rename(columns={"index": "column", 0: "dtype"})
)

schema_table
Out[4]:
column dtype
0 crash_date str
1 traffic_control_device str
2 weather_condition str
3 lighting_condition str
4 first_crash_type str
5 trafficway_type str
6 alignment str
7 roadway_surface_cond str
8 road_defect str
9 crash_type str
10 intersection_related_i str
11 damage str
12 prim_contributory_cause str
13 num_units int64
14 most_severe_injury str
15 injuries_total float64
16 injuries_fatal float64
17 injuries_incapacitating float64
18 injuries_non_incapacitating float64
19 injuries_reported_not_evident float64
20 injuries_no_indication float64
21 crash_hour int64
22 crash_day_of_week int64
23 crash_month int64

Time Range

In [5]:
first_date = df_csv_data["crash_date"].min()
last_date = df_csv_data["crash_date"].max()

print("First entry date:", first_date)
print("Last entry date:", last_date)
First entry date: 01/01/2016 01:03:00 AM
Last entry date: 12/31/2024 12:55:00 PM

Sampling Rate
There is no constant sampling rate since car accidents don't happen in fixed intervals.

Missingness Summary
Missing values in the dataset are encoded using the string 'UNKNOWN' rather than null values.

In [6]:
num_values_total = df_csv_data_raw.shape[0] * df_csv_data_raw.shape[1]
print("Total number of values in the dataset:", num_values_total)

unknown_total = (df_csv_data_raw == "UNKNOWN").sum().sum()
print("Total UNKNOWN values:", unknown_total)

unknown_percentage = (unknown_total / num_values_total) * 100
print(f"Percentage of UNKNOWN values: {unknown_percentage:.2f}%")
Total number of values in the dataset: 5023344
Total UNKNOWN values: 63320
Percentage of UNKNOWN values: 1.26%
In [7]:
missing_summary = pd.DataFrame({
    "missing_count": (df_csv_data_raw == "UNKNOWN").sum(),
    "missing_percent": ((df_csv_data_raw == "UNKNOWN").mean() * 100).round(2)
}).sort_values(by="missing_percent", ascending=False)

missing_summary
Out[7]:
missing_count missing_percent
road_defect 34426 16.45
roadway_surface_cond 12509 5.98
weather_condition 6534 3.12
traffic_control_device 4455 2.13
lighting_condition 4336 2.07
trafficway_type 1060 0.51
crash_date 0 0.00
injuries_total 0 0.00
crash_day_of_week 0 0.00
crash_hour 0 0.00
injuries_no_indication 0 0.00
injuries_reported_not_evident 0 0.00
injuries_non_incapacitating 0 0.00
injuries_incapacitating 0 0.00
injuries_fatal 0 0.00
prim_contributory_cause 0 0.00
most_severe_injury 0 0.00
num_units 0 0.00
damage 0 0.00
intersection_related_i 0 0.00
crash_type 0 0.00
alignment 0 0.00
first_crash_type 0 0.00
crash_month 0 0.00

Basic Statistical Analysis Using Pandas (descriptives, grouped stats, quantiles)¶

Descriptive Analysis
Qantiles, min., max., mean value and standard deviation of all columns.

In [8]:
df_csv_data = df_csv_data.replace("UNKNOWN", np.nan)
df_csv_data.describe()
Out[8]:
num_units injuries_total injuries_fatal injuries_incapacitating injuries_non_incapacitating injuries_reported_not_evident injuries_no_indication crash_hour crash_day_of_week crash_month
count 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000 209306.000000
mean 2.063300 0.382717 0.001859 0.038102 0.221241 0.121516 2.244002 13.373047 4.144024 6.771822
std 0.396012 0.799720 0.047502 0.233964 0.614960 0.450865 1.241175 5.603830 1.966864 3.427593
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 9.000000 2.000000 4.000000
50% 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 14.000000 4.000000 7.000000
75% 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 3.000000 17.000000 6.000000 10.000000
max 11.000000 21.000000 3.000000 7.000000 21.000000 15.000000 49.000000 23.000000 7.000000 12.000000
In [9]:
quantiles = df_csv_data.quantile(
    q=[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99],
    numeric_only=True
)
quantiles
Out[9]:
num_units injuries_total injuries_fatal injuries_incapacitating injuries_non_incapacitating injuries_reported_not_evident injuries_no_indication crash_hour crash_day_of_week crash_month
0.01 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
0.05 2.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 1.0 1.0
0.25 2.0 0.0 0.0 0.0 0.0 0.0 2.0 9.0 2.0 4.0
0.50 2.0 0.0 0.0 0.0 0.0 0.0 2.0 14.0 4.0 7.0
0.75 2.0 1.0 0.0 0.0 0.0 0.0 3.0 17.0 6.0 10.0
0.95 3.0 2.0 0.0 0.0 1.0 1.0 4.0 22.0 7.0 12.0
0.99 4.0 4.0 0.0 1.0 3.0 2.0 6.0 23.0 7.0 12.0

Grouped Stats

In [10]:
# Accidents per weekday
# Count number of accidents per weekday
weekday_counts = df_csv_data['crash_day_of_week'].value_counts().sort_index()

# Plot
# Map numeric days to short names (assumes 1=Sunday, 2=Monday, ..., 7=Saturday)
day_map = {1: "Sun", 2: "Mon", 3: "Tue", 4: "Wed", 5: "Thu", 6: "Fri", 7: "Sat"}
wc = weekday_counts.reset_index()
wc.columns = ["day_num", "count"]
wc["day_name"] = wc["day_num"].map(day_map)

# Order Monday -> Sunday
order = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
wc["day_name"] = pd.Categorical(wc["day_name"], categories=order, ordered=True)
wc = wc.sort_values("day_name")

plt.figure(figsize=(8, 4))
ax = sns.barplot(data=wc, x="day_name", y="count", palette="viridis")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Number of Accidents")
ax.set_title("Accidents per Weekday", fontsize=12)
ax.grid(axis="y", linestyle="--", alpha=0.6)

plt.tight_layout()
plt.show()
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\3204367320.py:18: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.barplot(data=wc, x="day_name", y="count", palette="viridis")
No description has been provided for this image
In [11]:
# Nicely styled barplot for total injuries per weekday
weekday_injuries = df_csv_data.groupby("crash_day_of_week")["injuries_total"].sum().sort_index()
day_map = {1: "Sun", 2: "Mon", 3: "Tue", 4: "Wed", 5: "Thu", 6: "Fri", 7: "Sat"}
wc = weekday_injuries.reset_index()
wc.columns = ["day_num", "injuries"]
wc["day_name"] = wc["day_num"].map(day_map)

# Order Monday -> Sunday
order = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
wc["day_name"] = pd.Categorical(wc["day_name"], categories=order, ordered=True)
wc = wc.sort_values("day_name")

plt.figure(figsize=(8, 4))
ax = sns.barplot(data=wc, x="day_name", y="injuries", palette="viridis")
ax.set_xlabel("Day of Week")
ax.set_ylabel("Total Number of Injuries")
ax.set_title("Total Number of Injuries per Weekday", fontsize=12)
ax.grid(axis="y", linestyle="--", alpha=0.6)

plt.tight_layout()
plt.show()
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\1750550436.py:14: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.barplot(data=wc, x="day_name", y="injuries", palette="viridis")
No description has been provided for this image
In [12]:
# Nicely styled weather condition distribution (top categories + "Other")
vc = df_csv_data["weather_condition"].fillna("UNKNOWN").value_counts()
top_n = 10
if len(vc) > top_n:
    top = vc.head(top_n).copy()
    top["Other"] = vc.iloc[top_n:].sum()
else:
    top = vc.copy()

df_plot = top.reset_index()
df_plot.columns = ["weather", "count"]
df_plot["pct"] = df_plot["count"] / df_plot["count"].sum() * 100

plt.figure(figsize=(10, 5))
ax = sns.barplot(data=df_plot, x="weather", y="count", palette="mako")
ax.set_title("Weather Condition Distribution", fontsize=12)
ax.set_xlabel("Weather Condition")
ax.set_ylabel("Number of Accidents")
ax.tick_params(axis="x", rotation=45, labelsize=8)
ax.tick_params(axis="y", labelsize=8)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f"{int(x):,}"))

# Annotate bars with count and percentage
max_count = df_plot["count"].max()
for p, cnt, pct in zip(ax.patches, df_plot["count"], df_plot["pct"]):
    ax.text(p.get_x() + p.get_width() / 2, p.get_height() + max_count * 0.01,
            f"{int(cnt):,}\n({pct:.1f}%)", ha="center", va="bottom", fontsize=8)

plt.grid(axis="y", linestyle="--", alpha=0.6)
sns.despine(trim=True)
plt.tight_layout()
plt.show()
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\1009555762.py:15: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.barplot(data=df_plot, x="weather", y="count", palette="mako")
No description has been provided for this image
In [13]:
# Analysis of correlation between injuries and weather condition
df_csv_data.groupby("weather_condition")["injuries_total"].agg(
    ["count", "mean", "median", "std"]
)
Out[13]:
count mean median std
weather_condition
BLOWING SAND, SOIL, DIRT 1 0.000000 0.0 NaN
BLOWING SNOW 127 0.314961 0.0 0.752781
CLEAR 164700 0.390401 0.0 0.812737
CLOUDY/OVERCAST 7533 0.369308 0.0 0.745190
FOG/SMOKE/HAZE 360 0.444444 0.0 0.829285
FREEZING RAIN/DRIZZLE 510 0.460784 0.0 0.864851
OTHER 627 0.483254 0.0 0.821607
RAIN 21703 0.411510 0.0 0.808552
SEVERE CROSS WIND GATE 32 0.250000 0.0 0.762001
SLEET/HAIL 308 0.461039 0.0 0.851388
SNOW 6871 0.308543 0.0 0.698708
In [14]:
# Count crashes per hour and plot
hour_counts = df_csv_data["crash_hour"].dropna().astype(int).value_counts().sort_index()

# Ensure hours 0-23 are present
hours = pd.Series(0, index=range(24))
hour_counts = hours.add(hour_counts, fill_value=0).astype(int)

df_hour = hour_counts.reset_index()
df_hour.columns = ["hour", "count"]

plt.figure(figsize=(12, 4))
ax = sns.barplot(data=df_hour, x="hour", y="count", palette="viridis")
ax.set_xlabel("Crash Hour (0-23)")
ax.set_ylabel("Number of Crashes")
ax.set_title("Count of Crashes by Hour of Day")
ax.set_xticks(range(24))
ax.set_xticklabels([str(h) for h in range(24)])
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: f"{int(x):,}"))

# Annotate bars
max_count = df_hour["count"].max()
for p, cnt in zip(ax.patches, df_hour["count"]):
    ax.text(p.get_x() + p.get_width() / 2, p.get_height() + max_count * 0.01,
            f"{int(cnt):,}", ha="center", va="bottom", fontsize=8)

plt.tight_layout()
plt.show()
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\3252361176.py:12: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.barplot(data=df_hour, x="hour", y="count", palette="viridis")
No description has been provided for this image

Original data quality analysis with visualization¶

Missingness Patterns

In [15]:
missing_counts = df_csv_data.isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]
Out[15]:
road_defect               34426
roadway_surface_cond      12509
weather_condition          6534
traffic_control_device     4455
lighting_condition         4336
trafficway_type            1060
dtype: int64
In [16]:
cols = [
    "road_defect",
    "roadway_surface_cond",
    "weather_condition",
    "traffic_control_device",
    "lighting_condition",
    "trafficway_type"
]

# Ensure datetime
df = df_csv_data.copy()
df["crash_date"] = pd.to_datetime(df["crash_date"], format='%m/%d/%Y %I:%M:%S %p')

# Keep only relevant columns
df = df[["crash_date"] + cols]

# Create monthly period
df["month"] = df["crash_date"].dt.to_period("M")

# Calculate fraction missing per month per column
monthly_missing = (
    df.groupby("month")[cols]
      .apply(lambda x: x.isna().mean())
)

# Convert PeriodIndex to timestamp for plotting
monthly_missing.index = monthly_missing.index.to_timestamp()

# Plot heatmap
plt.figure(figsize=(16, 6))
ax = sns.heatmap(
    monthly_missing.T,
    cmap="rocket_r",
    cbar_kws={"label": "Fraction Missing"},
    linewidths=0.3
)

# Year-only x-axis ticks
months = monthly_missing.index
year_ticks = [i for i, d in enumerate(months) if d.month == 1]
year_labels = [d.year for d in months if d.month == 1]

ax.set_xticks(year_ticks)
ax.set_xticklabels(year_labels, rotation=0)

plt.title("Monthly Missing Data Fraction by Column")
plt.xlabel("Year")
plt.ylabel("Column")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [17]:
# Prepare data
col_names = [
    "injuries_non_incapacitating",
    "injuries_incapacitating",
    "injuries_reported_not_evident",
    "injuries_no_indication",
    "injuries_fatal",
    "injuries_total"
]

selected = df_csv_data[col_names]

# Zero fraction (ignoring NaNs)
non_na_counts = selected.notna().sum()
zero_frac = (selected == 0).sum() / non_na_counts

# Melt non-zero values and add log1p transform
df_melted = selected.melt(var_name="Variable", value_name="Value")
df_nonzero = df_melted[df_melted["Value"] > 0].copy()
df_nonzero["log1p"] = np.log1p(df_nonzero["Value"])

pretty_labels = [
    "Non-incapacitating",
    "Incapacitating",
    "Reported (not evident)",
    "No indication",
    "Fatal",
    "Total"
]

# Plot 1: Zero fraction
plt.figure(figsize=(10, 5))
ax = sns.barplot(x=zero_frac.index, y=zero_frac.values * 100, palette="muted")
ax.set_ylabel("Percentage of zeros (%)")
ax.set_xlabel("")
ax.set_title("Zero fraction per variable")
ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
plt.tight_layout()
plt.show()

# Plot 2: Boxplot on log1p of non-zero values
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df_nonzero, x="Variable", y="log1p",
                 showcaps=True, showfliers=True, palette="vlag", order=col_names)
ax.set_ylabel("log1p(Value)")
ax.set_title("Boxplot of non-zero values (log1p)")
ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
plt.tight_layout()
plt.show()
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\1360554475.py:33: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.barplot(x=zero_frac.index, y=zero_frac.values * 100, palette="muted")
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\1360554475.py:37: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
No description has been provided for this image
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\1360554475.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax = sns.boxplot(data=df_nonzero, x="Variable", y="log1p",
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\1360554475.py:47: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_xticklabels(pretty_labels, rotation=45, ha="right")
No description has been provided for this image
In [18]:
# Check for typos, different capitalizations, or inconsistent entries in categorical columns

for col in df_csv_data.select_dtypes(include="object"):
    print(col)
    print(df_csv_data[col].value_counts().head())
    print("-"*40)

# No typos or inconsistencies found in categorical columns
C:\Users\Jere\AppData\Local\Temp\ipykernel_50520\1759486190.py:3: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  for col in df_csv_data.select_dtypes(include="object"):
crash_date
crash_date
12/29/2020 05:00:00 PM    10
02/17/2022 03:30:00 PM     8
06/17/2019 04:30:00 PM     6
11/26/2018 08:30:00 AM     6
09/27/2018 08:30:00 AM     6
Name: count, dtype: int64
----------------------------------------
traffic_control_device
traffic_control_device
TRAFFIC SIGNAL       123944
STOP SIGN/FLASHER     49139
NO CONTROLS           29508
OTHER                   670
YIELD                   468
Name: count, dtype: int64
----------------------------------------
weather_condition
weather_condition
CLEAR              164700
RAIN                21703
CLOUDY/OVERCAST      7533
SNOW                 6871
OTHER                 627
Name: count, dtype: int64
----------------------------------------
lighting_condition
lighting_condition
DAYLIGHT                  134109
DARKNESS, LIGHTED ROAD     53378
DARKNESS                    7436
DUSK                        6323
DAWN                        3724
Name: count, dtype: int64
----------------------------------------
first_crash_type
first_crash_type
TURNING                     64157
ANGLE                       52250
REAR END                    42018
SIDESWIPE SAME DIRECTION    20116
PEDESTRIAN                   8996
Name: count, dtype: int64
----------------------------------------
trafficway_type
trafficway_type
NOT DIVIDED                        77753
FOUR WAY                           49057
DIVIDED - W/MEDIAN (NOT RAISED)    34221
ONE-WAY                            12341
DIVIDED - W/MEDIAN BARRIER         10720
Name: count, dtype: int64
----------------------------------------
alignment
alignment
STRAIGHT AND LEVEL       204590
STRAIGHT ON GRADE          2992
CURVE, LEVEL               1014
STRAIGHT ON HILLCREST       478
CURVE ON GRADE              179
Name: count, dtype: int64
----------------------------------------
roadway_surface_cond
roadway_surface_cond
DRY              155905
WET               32908
SNOW OR SLUSH      6203
ICE                1303
OTHER               438
Name: count, dtype: int64
----------------------------------------
road_defect
road_defect
NO DEFECTS         171730
WORN SURFACE         1000
OTHER                 912
RUT, HOLES            741
SHOULDER DEFECT       358
Name: count, dtype: int64
----------------------------------------
crash_type
crash_type
NO INJURY / DRIVE AWAY              117376
INJURY AND / OR TOW DUE TO CRASH     91930
Name: count, dtype: int64
----------------------------------------
intersection_related_i
intersection_related_i
Y    199324
N      9982
Name: count, dtype: int64
----------------------------------------
damage
damage
OVER $1,500      147313
$501 - $1,500     41210
$500 OR LESS      20783
Name: count, dtype: int64
----------------------------------------
prim_contributory_cause
prim_contributory_cause
UNABLE TO DETERMINE              58316
FAILING TO YIELD RIGHT-OF-WAY    42914
FOLLOWING TOO CLOSELY            19084
DISREGARDING TRAFFIC SIGNALS     14591
IMPROPER TURNING/NO SIGNAL       12643
Name: count, dtype: int64
----------------------------------------
most_severe_injury
most_severe_injury
NO INDICATION OF INJURY     154789
NONINCAPACITATING INJURY     31527
REPORTED, NOT EVIDENT        16075
INCAPACITATING INJURY         6564
FATAL                          351
Name: count, dtype: int64
----------------------------------------

Analysis of Timestamp Gaps

In [19]:
# Ensure datetime and sort
df = df_csv_data.copy()
df["crash_date"] = pd.to_datetime(df["crash_date"], format='%m/%d/%Y %I:%M:%S %p')
df = df.sort_values("crash_date")

# Calculate gaps between consecutive accidents
df["time_gap"] = df["crash_date"].diff()

# Drop the first row (NaT gap)
gaps = df.dropna(subset=["time_gap"])

# Filter gaps greater than 1 day
long_gaps = (
    gaps[gaps["time_gap"] > pd.Timedelta(days=1)]
    [["crash_date", "time_gap"]]
    .rename(columns={"crash_date": "gap_end"})
)

# Add gap start for clarity
long_gaps["gap_start"] = long_gaps["gap_end"] - long_gaps["time_gap"]

# Optional: sort largest gaps first
long_gaps = long_gaps.sort_values("time_gap", ascending=False)

long_gaps
Out[19]:
gap_end time_gap gap_start
121162 2015-02-13 08:00:00 621 days 11:31:00 2013-06-01 20:29:00
132318 2015-05-25 23:38:00 101 days 15:38:00 2015-02-13 08:00:00
23119 2013-06-01 20:29:00 90 days 03:41:00 2013-03-03 16:48:00
106211 2015-08-02 19:55:00 68 days 20:17:00 2015-05-25 23:38:00
46652 2015-08-14 09:30:00 2 days 19:15:00 2015-08-11 14:15:00
42585 2015-08-06 10:00:00 1 days 19:30:00 2015-08-04 14:30:00
1958 2015-08-10 09:15:00 1 days 12:00:00 2015-08-08 21:15:00
15595 2015-08-17 01:11:00 1 days 10:11:00 2015-08-15 15:00:00
96371 2015-08-11 14:15:00 1 days 05:00:00 2015-08-10 09:15:00
In [20]:
# Ensure datetime
df["crash_date"] = pd.to_datetime(df["crash_date"], format='%m/%d/%Y %I:%M:%S %p')

# Weekly record counts
weekly_counts = (
    df.set_index("crash_date")
      .resample("W")
      .size()
)

# Compute overall gap span
gap_span_start = long_gaps["gap_start"].min()
gap_span_end = long_gaps["gap_end"].max()

# Plot full timeline
plt.figure(figsize=(14, 4))
plt.plot(weekly_counts.index, weekly_counts.values, label="Number of weekly records")
plt.axhline(0, linestyle="--", linewidth=1)

# Highlight overall gap timespan
plt.axvspan(
    gap_span_start,
    gap_span_end,
    alpha=0.25,
    label="Timespan with gaps >1 day"
)

plt.title("Record Count per Week with Record Gaps Highlighted")
plt.xlabel("Year")
plt.ylabel("Number of Records")
plt.legend()

plt.tight_layout()
plt.show()
No description has been provided for this image
In [21]:
# Ensure datetime
df_plot = df_csv_data.copy()
df_plot["crash_date"] = pd.to_datetime(df_plot["crash_date"], format='%m/%d/%Y %I:%M:%S %p')

# Loop through injury columns
for col in col_names:
    if col not in df_plot.columns:
        print(f"Skipping {col}: column not found.")
        continue

    # Resample weekly and sum values
    weekly_sum_col = (
        df_plot.set_index("crash_date")[col]
        .resample("W-MON")
        .sum(min_count=1)  # min_count=1 ensures empty weeks show NaN instead of 0
    )

    # Plot
    plt.figure(figsize=(14, 4))
    plt.plot(weekly_sum_col.index, weekly_sum_col.values, label=f"Weekly sum ({col})")
    plt.axhline(0, linestyle="--", linewidth=1)

    plt.title(f"Weekly Sum of {col}")
    plt.xlabel("Year")
    plt.ylabel("Sum of Injuries")
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Cleaning Data To clean the data the decision was made to remove all data entries before November of 2017. The reason for that is that there is way more data being sampled after that. The sampling frequency seems inconsistent before and rather consisten after November fo 2017.

In [22]:
# Make a copy and ensure datetime
df_csv_data_cleaned = df_csv_data.copy()
df_csv_data_cleaned["crash_date"] = pd.to_datetime(df_csv_data_cleaned["crash_date"], format='%m/%d/%Y %I:%M:%S %p')

# Remove entries before November 2017
df_csv_data_cleaned = df_csv_data_cleaned[df_csv_data_cleaned["crash_date"] >= "2017-11-01"]

Visualizing removed data.

In [23]:
# Ensure datetime
df_plot = df_csv_data.copy()
df_plot["crash_date"] = pd.to_datetime(df_plot["crash_date"], format='%m/%d/%Y %I:%M:%S %p')
df_cleaned = df_csv_data_cleaned.copy()
df_cleaned["crash_date"] = pd.to_datetime(df_cleaned["crash_date"], format='%m/%d/%Y %I:%M:%S %p')

# Split removed data
df_removed = df_plot[df_plot["crash_date"] < "2017-11-01"]

# Concatenate for consistent weekly resampling
df_combined = pd.concat([df_removed, df_cleaned])

# Loop through injury columns
for col in col_names:
    if col not in df_plot.columns:
        print(f"Skipping {col}: column not found.")
        continue

    # Resample weekly over the full timeline
    weekly_sum = df_combined.set_index("crash_date")[col].resample("W-MON").sum()

    # Separate removed vs cleaned for plotting
    weekly_removed = weekly_sum.loc[weekly_sum.index < "2017-11-01"]
    weekly_cleaned = weekly_sum.loc[weekly_sum.index >= "2017-11-01"]

    # Plot
    plt.figure(figsize=(14, 4))
    plt.plot(weekly_removed.index, weekly_removed.values, label=f"Removed data")
    plt.plot(weekly_cleaned.index, weekly_cleaned.values, label=f"Cleaned data")
    plt.axhline(0, linestyle="--", linewidth=1)

    plt.title(f"Weekly Sum of {col}")
    plt.xlabel("Year")
    plt.ylabel("Sum of Injuries")
    plt.legend()
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

B. Visualization and Exploratory Analysis (55 points)¶


Assigned to Jeremia

  • Time-series visualizations (raw, smoothed, rolling mean or windowed views) (10 points)
  • Distribution analysis with histograms and density style plots where applicable (10 points)
  • Correlation analysis and heatmaps (Pearson and at least one alternative such as Spearman, with short interpretation) (10 points)
  • Daily or periodic pattern analysis (day-of-week, hour-of-day, seasonality indicators, or test-cycle patterns) (15 points)
  • Summary of observed patterns as short check statements (similar to True/False style) with evidence (10 points)
In [24]:
df_b = df_csv_data_cleaned.copy() # copy original data
#print(df_b.head())
# Display a Time Series Plot of weekly accidents
df_b.set_index('crash_date', inplace=True)

Time-series visualization¶

In [25]:
weekly_accidents = df_b.resample('W').size()

plt.figure(figsize=(10, 6))
plt.plot(weekly_accidents.index, weekly_accidents.values, marker='')
plt.title('Weekly Accidents Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Accidents')
plt.grid(True)
plt.show()
No description has been provided for this image

Distribution analysis: histograms and density plots¶

We focus on smoothed distributions for key numeric fields and normalized comparisons for major categorical factors using df_b.

In [26]:
# Crash hour distribution by weather (independent density per weather, unshared y-axis for visibility)
hour_weather = df_b.dropna(subset=["crash_hour", "weather_condition"])

g = sns.displot(
    hour_weather,
    x="crash_hour",
    col="weather_condition",
    col_wrap=3,
    bins=24,
    stat="density",
    kde=True,
    color="teal",
    height=3,
    aspect=1.2,
    common_norm=False,          # normalize densities within each weather facet
    facet_kws={"sharey": False} # allow y-axis to scale per facet so curves are visible
)

# Set x-axis grid/ticks every 3 hours and update titles for all facets
for ax in g.axes.flatten():
    ax.set_xticks(np.arange(0, 24, 3))
    ax.grid(True, axis="x")
    # Update title to show only the weather condition value
    title = ax.get_title()
    if "=" in title:
        ax.set_title(title.split("=")[1].strip())

plt.suptitle("Crash hour distribution by weather condition", y=1.02)
plt.show()
No description has been provided for this image
In [27]:
# Top contributory causes (proportions) to highlight dominant categories
cause_counts = df_b["prim_contributory_cause"].value_counts(normalize=True).head(10)

sns.barplot(
    x=cause_counts.values,
    y=cause_counts.index,
    orient="h",
    color="slateblue"
)
plt.title("Top 10 primary contributory causes (share of crashes)")
plt.xlabel("Proportion of crashes")
plt.ylabel("Primary cause")
plt.xlim(0, cause_counts.values.max() * 1.1)
plt.show()
No description has been provided for this image

Histogram of total injuries by crash hour¶

Aggregated injuries per hour of day using df_b (weights capture total injuries, not just crash counts).

In [28]:
# Stacked barplot of injury types by crash hour
injury_cols = [
    "injuries_fatal",
    "injuries_incapacitating",
    "injuries_non_incapacitating",
    "injuries_reported_not_evident",
]

hour_inj = df_b.dropna(subset=["crash_hour"] + injury_cols)
hour_inj_sum = (
    hour_inj.groupby("crash_hour")[injury_cols]
    .sum()
    .reindex(range(24), fill_value=0)
)

bottom = np.zeros(24)
hours = hour_inj_sum.index
colors = ["#d73027", "#fc8d59", "#fee08b", "#91bfdb"]

plt.figure()
for col, color in zip(injury_cols, colors):
    plt.bar(
        hours,
        hour_inj_sum[col],
        bottom=bottom,
        label=col.replace("_", " "),
        color=color,
        edgecolor="white"
    )
    bottom += hour_inj_sum[col].values

plt.title("Injuries by crash hour (stacked)")
plt.xlabel("Crash hour")
plt.ylabel("Count of injuries")
plt.xticks(range(0, 24, 2))
plt.grid(True, axis="x")
plt.legend(title="Injury type", fontsize=6, title_fontsize=7)
plt.show()
No description has been provided for this image
In [29]:
# Normalize injury counts per hour to percentages and plot stacked bars (100% total per hour)
hour_inj_pct = hour_inj_sum.div(hour_inj_sum.sum(axis=1), axis=0).fillna(0) * 100

bottom_pct = np.zeros(len(hours))
plt.figure()
for col, color in zip(injury_cols, colors):
    plt.bar(
        hour_inj_pct.index,
        hour_inj_pct[col],
        bottom=bottom_pct,
        label=col.replace("_", " "),
        color=color,
        edgecolor="white"
    )
    bottom_pct += hour_inj_pct[col].values

plt.title("Injuries by crash hour (stacked, 100% scale)")
plt.xlabel("Crash hour")
plt.ylabel("Share of injuries (%)")
plt.ylim(0, 100)
plt.xticks(range(0, 24, 2))
plt.grid(True, axis="x")
plt.legend(title="Injury type", fontsize=6, title_fontsize=7)
plt.show()
No description has been provided for this image

Correlation analysis (Pearson & Spearman)¶

We focus on numeric crash/injury metrics where correlation is meaningful; both Pearson (linear) and Spearman (rank/monotonic) are shown.

In [30]:
# Pearson correlation heatmap (linear relationships)
num_cols = [
    "injuries_fatal",
    "injuries_incapacitating",
    "injuries_non_incapacitating",
    "injuries_reported_not_evident",
    "num_units",
    "crash_hour",
    "crash_day_of_week",
    "crash_month",
]

corr_df = df_b.reset_index()[num_cols].dropna()
corr_p = corr_df.corr(method="pearson")

plt.figure(figsize=(6, 5))
sns.heatmap(
    corr_p,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    vmin=-1,
    vmax=1,
    square=True,
    cbar=False,
    annot_kws={"size": 4}
)
plt.title("Pearson correlation (linear)")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [31]:
# Spearman correlation heatmap (rank/monotonic relationships)
corr_s = corr_df.corr(method="spearman")

plt.figure(figsize=(6, 5))
sns.heatmap(
    corr_s,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    vmin=-1,
    vmax=1,
    square=True,
    cbar=False,
    annot_kws={"size": 4}
)
plt.title("Spearman correlation (rank)")
plt.tight_layout()
plt.show()
No description has been provided for this image

Seasonality by week of year¶

Mean crashes per ISO week (across years) with variability to highlight weekly seasonality.

In [32]:
# Weekly seasonality across years (mean crashes per ISO week with variability)
# Ensure datetime index
if not np.issubdtype(df_b.index.dtype, np.datetime64):
    df_b = df_b.copy()
    df_b.index = pd.to_datetime(df_b.index)

weekly_counts = df_b.resample("W").size()
iso_week = weekly_counts.index.isocalendar().week
weekly_stats = weekly_counts.groupby(iso_week).agg(["mean", "std"])

weeks = sorted(weekly_stats.index)
mean_vals_w = weekly_stats.loc[weeks, "mean"]
std_vals_w = weekly_stats.loc[weeks, "std"]

plt.figure(figsize=(10, 4))
plt.bar(weeks, mean_vals_w, yerr=std_vals_w, color="seagreen", edgecolor="white", alpha=0.9, capsize=2)
plt.xticks(range(0, 53, 4))
plt.title("Average crashes per ISO week (multi-year)")
plt.xlabel("ISO week number")
plt.ylabel("Mean crashes per week")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

Seasonality by month¶

Crashes aggregated across multiple years to reveal month-of-year patterns (mean monthly crashes with variability).

In [33]:
# Monthly seasonality across years (mean crashes per month with variability)
import calendar

# Ensure datetime index
if not np.issubdtype(df_b.index.dtype, np.datetime64):
    df_b = df_b.copy()
    df_b.index = pd.to_datetime(df_b.index)

monthly_counts = df_b.resample("ME").size()
monthly_stats = monthly_counts.groupby(monthly_counts.index.month).agg(["mean", "std"])

months = range(1, 13)
mean_vals = monthly_stats["mean"].reindex(months)
std_vals = monthly_stats["std"].reindex(months)

plt.figure(figsize=(8, 4))
plt.bar(months, mean_vals, yerr=std_vals, color="steelblue", edgecolor="white", alpha=0.9, capsize=3)
plt.xticks(months, [calendar.month_abbr[m] for m in months], rotation=0)
plt.title("Average crashes per month (multi-year)")
plt.xlabel("Month")
plt.ylabel("Mean crashes per month")
plt.tight_layout()
plt.show()
No description has been provided for this image

Summary¶

  • Crash hour distribution differs by weather condition.
  • The top cause is "failing to yield right-of-way"
  • The Rushhours have peaks in crashes. At 7-8 h and 15-17 h
  • Fatal and incapacitating injuries are relatively more in night time.
  • The numerical columns don't correlate.
  • In the Chrisms holydays the number of accidents is reduced.

C. Probability and Event Analysis (45 points)¶


Assigned to Tim

  • Threshold-based probability estimation for events (define event, justify threshold, compute empirical probability) (15 points)
  • Cross tabulation analysis for two variables (10 points)
  • Conditional probability analysis (at least two meaningful conditional relationships) (15 points)
  • Summary of observations and limitations (what could bias these estimates, what assump- tions were made) (5 points)

Threshold-based probability estimation¶

In [34]:
# Important Notes:
# purely Data-Driven
# Depends on Size of data, Observation period Data quality

# Event 1: Late hour Crash (between 20:00 and 5:00)
# Relevance: Poor visibility, fatigue, alcohol...
# Thresholds: typical night hours. 
late_hour_event = (
    (df_csv_data_cleaned["crash_hour"] >= 20) |
    (df_csv_data_cleaned["crash_hour"] <= 5)
)

num_late_hour_crashes = late_hour_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_late_hour = num_late_hour_crashes / total_crashes
print(f"Empirical Probability Event 1 (Late Hour Crash): {empirical_probability_late_hour:.3%}")

# Event 2: Rush-hour crash (7:00-9:00 and 16:00-18:00)
# Relevance: High traffic density, frequent stop-and-go driving, time pressure and stress
# Thresholds: standard commuting periods
rush_hour_event = (
    ((df_csv_data_cleaned["crash_hour"] >= 7) & (df_csv_data_cleaned["crash_hour"] <= 9)) |
    ((df_csv_data_cleaned["crash_hour"] >= 16) & (df_csv_data_cleaned["crash_hour"] <= 18))
)

num_rush_hour_crashes = rush_hour_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_rush_hour = num_rush_hour_crashes / total_crashes
print(f"Empirical Probability Event 2 (Rush Hour Crash): {empirical_probability_rush_hour:.3%}")

# Event 3: Severe crash
# Relevance: fatalities, or life-altering injuries requiring intensive medical care
# Trhesholds: greater than one injury, focus on severity rather than frequency, avoids dilution by
# minor or non-injury crashes
severe_crash_event = (
    (df_csv_data_cleaned["injuries_fatal"] >= 1) |
    (df_csv_data_cleaned["injuries_incapacitating"] >= 1)
)

num_severe_crashes = severe_crash_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_severe = num_severe_crashes / total_crashes

print(
    f"Empirical Probability Event 3 (Severe Crash): "
    f"{empirical_probability_severe:.3%}"
)

# Event 4: Non Severe crash
# Relevance: Most of the Accidents are non severe crashes with no major injuries
non_severe_crash_event = (
    (df_csv_data_cleaned["injuries_fatal"] == 0) &
    (df_csv_data_cleaned["injuries_incapacitating"] == 0)
)

num_non_severe_crashes = non_severe_crash_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_non_severe = num_non_severe_crashes / total_crashes

print(
    f"Empirical Probability Event 4 (Non-Severe Crash): "
    f"{empirical_probability_non_severe:.3%}"
)

# Since this is the complement of the severe crash event, the added Probability should be 1
print(
    f"Empirical Probability Non-Severe Crash + Severe Crash: "
    f"{empirical_probability_non_severe + empirical_probability_severe:.3%}"
)

# Event 5: Rainy weather crash
# Relevance: Rain and freezing rain: reduce visability, decrease tire-road friction, increase
# braking distances
# Threshold: Freezing rain is grouped with rain, since both involve liquid precipation effecting the
# surface friction
rainy_weather_event = df_csv_data_cleaned["weather_condition"].isin([
    "RAIN",
    "FREEZING RAIN/DRIZZLE"
])

num_rainy_weather_crashes = rainy_weather_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_rainy_weather = num_rainy_weather_crashes / total_crashes

print(
    f"Empirical Probability Event 5 (Rainy Weather Crash): "
    f"{empirical_probability_rainy_weather:.3%}"
)

# Event 6: Snowy weather crash
# Relevance: Severly reduced friction, impair vehicle control, may obscure lane markings and ohter
# vehicles
# Threshold: Snow-related categories are grouped due to similar driving hazards
snowy_weather_event = df_csv_data_cleaned["weather_condition"].isin([
    "SNOW",
    "BLOWING SNOW"
])

num_snowy_weather_crashes = snowy_weather_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_snowy_weather = num_snowy_weather_crashes / total_crashes

print(
    f"Empirical Probability Event 6 (Snowy Weather Crash): "
    f"{empirical_probability_snowy_weather:.3%}"
)


# Event 7: Poor visibility weather crash
# Relevance: these conditions reduce a driver's ability to: -detect hazards, -judge distance and
# speed, -react in time
# Threshold: Only conditions with direct visibility impairment are included, Categories are chosen
# conservatively to avoid overgeneralization
poor_visibility_event = df_csv_data_cleaned["weather_condition"].isin([
    "FOG/SMOKE/HAZE",
    "SLEET/HAIL",
    "BLOWING SNOW",
    "BLOWING SAND, SOIL, DIRT"
])

num_poor_visibility_crashes = poor_visibility_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_poor_visibility = num_poor_visibility_crashes / total_crashes

print(
    f"Empirical Probability Event 7 (Poor Visibility Weather Crash): "
    f"{empirical_probability_poor_visibility:.3%}"
)

# Event 8: Clear
# Relevance: Baseline driving environment, minimal weather-related visual impairment.
# Threshold: Cloudy/overcast conditions usually do not reduce visibility significantly, provides a
# meaningful contrast to poor visibility events
clear_weather_event = df_csv_data_cleaned["weather_condition"].isin([
    "CLEAR",
    "CLOUDY/OVERCAST"
])

num_clear_weather_crashes = clear_weather_event.sum()
total_crashes = len(df_csv_data_cleaned)

empirical_probability_clear_weather = num_clear_weather_crashes / total_crashes

print(
    f"Empirical Probability Event 8 (Clear Weather Crash): "
    f"{empirical_probability_clear_weather:.3%}"
)
Empirical Probability Event 1 (Late Hour Crash): 22.856%
Empirical Probability Event 2 (Rush Hour Crash): 35.926%
Empirical Probability Event 3 (Severe Crash): 3.397%
Empirical Probability Event 4 (Non-Severe Crash): 96.603%
Empirical Probability Non-Severe Crash + Severe Crash: 100.000%
Empirical Probability Event 5 (Rainy Weather Crash): 10.532%
Empirical Probability Event 6 (Snowy Weather Crash): 3.552%
Empirical Probability Event 7 (Poor Visibility Weather Crash): 0.394%
Empirical Probability Event 8 (Clear Weather Crash): 82.141%
In [35]:
# Visualization

# Bar Plot
event_probabilities = pd.DataFrame({
    "Event": [
        "Late Hour (20–5)",
        "Rush Hour (7–9, 16–18)",
        "Severe Crash",
        "Non-Severe Crash",
        "Rainy Weather",
        "Snowy Weather",
        "Poor Visibility",
        "Clear Weather"
    ],
    "Probability": [
        empirical_probability_late_hour,
        empirical_probability_rush_hour,
        empirical_probability_severe,
        empirical_probability_non_severe,
        empirical_probability_rainy_weather,
        empirical_probability_snowy_weather,
        empirical_probability_poor_visibility,
        empirical_probability_clear_weather
    ]
})

plt.figure(figsize=(12, 6))
plt.bar(event_probabilities["Event"], event_probabilities["Probability"])
plt.ylabel("Empirical Probability")
plt.title("Empirical Probabilities of Defined Crash Events")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
No description has been provided for this image

Cross tabulation analysis¶

In [36]:
# Function to display Cross tabulation
def plot_crosstab_table(crosstab, title, xlabel, ylabel):
    # Absolute values
    counts = crosstab.values

    # Row-wise proportions
    proportions = crosstab.div(crosstab.sum(axis=1), axis=0).values

    n_rows, n_cols = counts.shape

    plt.figure()
    plt.imshow(np.ones_like(counts), vmin=0, vmax=1)  # neutral background
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

    # Axis ticks
    plt.xticks(range(n_cols), crosstab.columns.astype(str))
    plt.yticks(range(n_rows), crosstab.index.astype(str))

    # Grid lines
    plt.grid(False)
    plt.gca().set_xticks(np.arange(-.5, n_cols, 1), minor=True)
    plt.gca().set_yticks(np.arange(-.5, n_rows, 1), minor=True)
    plt.grid(which="minor")

    # Cell annotations
    for i in range(n_rows):
        for j in range(n_cols):
            plt.text(
                j, i,
                f"{counts[i, j]}\n({proportions[i, j]*100:.1f}%)",
                ha="center",
                va="center"
            )

    plt.tight_layout()
    plt.show()

# Cross Tabulation 1: Late-Hour Crash vs Severe Crash
# Relevance: Night driving is associated with: -fatigue, -reduced visibility, -alcohol.
# What we analyze: Are severe crashes over-represented during late hours?
crosstab_late_severe = pd.crosstab(
    late_hour_event,
    severe_crash_event,
    rownames=["Late Hour Crash"],
    colnames=["Severe Crash"]
)

print("Cross Tabulation: Late-Hour Crash vs Severe Crash")
print(crosstab_late_severe)
plot_crosstab_table(
    crosstab_late_severe,
    title="Late-Hour Crash vs Severe Crash",
    xlabel="Severe Crash",
    ylabel="Late-Hour Crash"
)

# Cross Tabulation 2: Rush-Hour Crash vs Non-Severe Crash
# Relevance: Rush hours involve: -high traffic density, -lower speeds, -frequent stop and go
# situations
# Hypothesis: more crashes, but often less severe
# What we analyze: What is the proportion of non-severe crashes higher during rush hours?
crosstab_rush_non_severe = pd.crosstab(
    rush_hour_event,
    non_severe_crash_event,
    rownames=["Rush Hour Crash"],
    colnames=["Non-Severe Crash"]
)

print("\nCross Tabulation: Rush-Hour Crash vs Non-Severe Crash")
print(crosstab_rush_non_severe)
plot_crosstab_table(
    crosstab_rush_non_severe,
    title="Rush-Hour Crash vs Non-Severe Crash",
    xlabel="Non-Severe Crash",
    ylabel="Rush-Hour Crash"
)

# Cross Tabulation 3: Poor Visibility vs Severe Crash
# Relevance: Visability directly affects reaction time. Well established relationship in traffic
# safty research
# What we analyze: Do severe crashes occur more frequently under poor visibility conditions?
crosstab_visibility_severe = pd.crosstab(
    poor_visibility_event,
    severe_crash_event,
    rownames=["Poor Visibility"],
    colnames=["Severe Crash"]
)

print("\nCross Tabulation: Poor Visibility vs Severe Crash")
print(crosstab_visibility_severe)
plot_crosstab_table(
    crosstab_visibility_severe,
    title="Poor Visibility vs Severe Crash",
    xlabel="Severe Crash",
    ylabel="Poor Visibility"
)

# Cross Tabulation 4: Clear Weather vs Non-Severe Crash
# Relevance: Clear conditions serve as a baseline
# What we analyze: Are crashes under clear weather more likely to be non-severe?
crosstab_clear_non_severe = pd.crosstab(
    clear_weather_event,
    non_severe_crash_event,
    rownames=["Clear Weather"],
    colnames=["Non-Severe Crash"]
)
print("\nCross Tabulation: Clear Weather vs Non-Severe Crash")
print(crosstab_clear_non_severe)
plot_crosstab_table(
    crosstab_clear_non_severe,
    title="Clear Weather vs Non-Severe Crash",
    xlabel="Non-Severe Crash",
    ylabel="Clear Weather"
)
Cross Tabulation: Late-Hour Crash vs Severe Crash
Severe Crash      False  True 
Late Hour Crash               
False            139608   4344
True              40655   1994
No description has been provided for this image
Cross Tabulation: Rush-Hour Crash vs Non-Severe Crash
Non-Severe Crash  False   True 
Rush Hour Crash                
False              4407  115155
True               1931   65108
No description has been provided for this image
Cross Tabulation: Poor Visibility vs Severe Crash
Severe Crash      False  True 
Poor Visibility               
False            179546   6319
True                717     19
No description has been provided for this image
Cross Tabulation: Clear Weather vs Non-Severe Crash
Non-Severe Crash  False   True 
Clear Weather                  
False               931   32394
True               5407  147869
No description has been provided for this image

Conditional Probability¶

In [37]:
# Conditional Probability 1:
# Probability of a severe crash given a late-hour crash
# Relevance: Late hours are associated with fatigue, alcohol, and reduced visibility,
# which may increase crash severity.

# Marginal probabilities
P_severe = severe_crash_event.sum() / len(df_csv_data_cleaned)
P_late_hour = late_hour_event.sum() / len(df_csv_data_cleaned)

# Conditional probabilities
P_severe_given_late = (
    severe_crash_event & late_hour_event
).sum() / late_hour_event.sum()

P_late_given_severe = (
    severe_crash_event & late_hour_event
).sum() / severe_crash_event.sum()

print(f"P(Severe Crash): {P_severe:.3%}")
print(f"P(Late-Hour Crash): {P_late_hour:.3%}")
print(f"P(Severe Crash | Late-Hour Crash): {P_severe_given_late:.3%}")
print(f"P(Late-Hour Crash | Severe Crash): {P_late_given_severe:.3%}")


# Conditional Probability 2:
# Probability of a severe crash given poor visibility conditions
# Relevance: Poor visibility impairs hazard perception and reaction time,
# potentially leading to higher-impact collisions.

# Marginal probabilities
P_poor_visibility = poor_visibility_event.sum() / len(df_csv_data_cleaned)

# Conditional probabilities
P_severe_given_poor_visibility = (
    severe_crash_event & poor_visibility_event
).sum() / poor_visibility_event.sum()

P_poor_visibility_given_severe = (
    severe_crash_event & poor_visibility_event
).sum() / severe_crash_event.sum()

print(f"P(Severe Crash): {P_severe:.3%}")
print(f"P(Poor Visibility Crash): {P_poor_visibility:.3%}")
print(f"P(Severe Crash | Poor Visibility): {P_severe_given_poor_visibility:.3%}")
print(f"P(Poor Visibility Crash | Severe Crash): {P_poor_visibility_given_severe:.3%}")



# Conditional Probability 3:
# Probability of a non-severe crash given a rush-hour crash
# Relevance: Rush-hour traffic typically involves lower speeds due to congestion,
# which may reduce injury severity despite higher crash frequency.

# Marginal probabilities
P_non_severe = non_severe_crash_event.sum() / len(df_csv_data_cleaned)
P_rush_hour = rush_hour_event.sum() / len(df_csv_data_cleaned)

# Conditional probabilities
P_non_severe_given_rush = (
    non_severe_crash_event & rush_hour_event
).sum() / rush_hour_event.sum()

P_rush_given_non_severe = (
    non_severe_crash_event & rush_hour_event
).sum() / non_severe_crash_event.sum()

print(f"P(Non-Severe Crash): {P_non_severe:.3%}")
print(f"P(Rush-Hour Crash): {P_rush_hour:.3%}")
print(f"P(Non-Severe Crash | Rush-Hour Crash): {P_non_severe_given_rush:.3%}")
print(f"P(Rush-Hour Crash | Non-Severe Crash): {P_rush_given_non_severe:.3%}")
P(Severe Crash): 3.397%
P(Late-Hour Crash): 22.856%
P(Severe Crash | Late-Hour Crash): 4.675%
P(Late-Hour Crash | Severe Crash): 31.461%
P(Severe Crash): 3.397%
P(Poor Visibility Crash): 0.394%
P(Severe Crash | Poor Visibility): 2.582%
P(Poor Visibility Crash | Severe Crash): 0.300%
P(Non-Severe Crash): 96.603%
P(Rush-Hour Crash): 35.926%
P(Non-Severe Crash | Rush-Hour Crash): 97.120%
P(Rush-Hour Crash | Non-Severe Crash): 36.118%

Summary of Observations¶

Observations:

  • Contrary to intuitive expectations, the conditional probabilities of severe crashes under
    late-hour conditions and poor visibility conditions are relatively low and of similar magnitude.
    P(Severe Crash∣Late Hour)≈4.7%,
    𝑃(Severe Crash∣Poor Visibility)≈2.6%,
    𝑃(Severe Crash∣Rush Hour)≈2.29%
  • Conditions that increase the likelihood of crashes do not necessarily increase the severity of outcomes.

Assumptions:

  • All crashes are treated as independent observations, even though multiple crashes may involve similar locations, times, or drivers.
  • Binary event definitions (e.g., severe vs non-severe) adequately capture the complexity of crash outcomes.
  • Weather and lighting conditions recorded at the time of the crash correctly represent the actual driving conditions experienced by drivers.
  • The chosen thresholds (e.g., late hours, rush hours, severity definitions) meaningfully separate different risk regimes.

Potential Bias and Limitations:

  • Fatal and incapacitating injury crashes are relatively rare.
  • This can lead to unstable probability estimates and large relative changes from small absolute differences.
  • Injury severity may be misclassified or updated after initial reporting.
  • Weather and visibility conditions are categorical and may not capture intensity (e.g., light vs heavy rain).
  • Time-based thresholds (late hours, rush hours) are somewhat arbitrary and based on common conventions.
  • Slight changes to these thresholds could alter probability estimates.
  • Variables such as speed, driver age, alcohol involvement, and road type are not controlled for.
  • Traffic volume is not directly observed, limiting interpretation of frequency-based probabilities.

D. Statistical Theory Applications (45 points)¶


Assigned to Tim

  • Law of Large Numbers demonstration (15 points)
  • Central Limit Theorem application (sampling distributions, effect of sample size, interpretation) (25 points)
  • Result interpretation and sanity checks (what would invalidate your conclusion, what you verified) (5 points)

Law of Large Numbers¶

In [38]:
# Convert severe crash event to indicator variable
# 1 = severe crash, 0 = non-severe crash
severe_indicator = severe_crash_event.astype(int).values.copy()

# Shuffle observations to simulate random order
np.random.seed(42)
np.random.shuffle(severe_indicator)

# Sample sizes
sample_sizes = np.arange(1, len(severe_indicator) + 1)

# Cumulative empirical probability
cumulative_mean = np.cumsum(severe_indicator) / sample_sizes

# Overall empirical probability (reference)
overall_probability = cumulative_mean[-1]

# Zoomed in Plot
max_samples = 100

plt.figure()
plt.plot(
    sample_sizes[:max_samples],
    cumulative_mean[:max_samples],
    label="Cumulative Empirical Probability"
)

plt.axhline(
    overall_probability,
    linestyle="--",
    label="Overall Empirical Probability"
)

plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
    "Law of Large Numbers Demonstration\n"
    "Empirical Probability (First 100)"
)
plt.legend()
plt.show()

max_samples = 1000

plt.figure()
plt.plot(
    sample_sizes[:max_samples],
    cumulative_mean[:max_samples],
    label="Cumulative Empirical Probability"
)

plt.axhline(
    overall_probability,
    linestyle="--",
    label="Overall Empirical Probability"
)

plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
    "Law of Large Numbers Demonstration\n"
    "Empirical Probability (First 1000)"
)
plt.legend()
plt.show()

max_samples = 5000

plt.figure()
plt.plot(
    sample_sizes[:max_samples],
    cumulative_mean[:max_samples],
    label="Cumulative Empirical Probability"
)

plt.axhline(
    overall_probability,
    linestyle="--",
    label="Overall Empirical Probability"
)

plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
    "Law of Large Numbers Demonstration\n"
    "Empirical Probability (First 5000)"
)
plt.legend()
plt.show()

max_samples = 10000

plt.figure()
plt.plot(
    sample_sizes[:max_samples],
    cumulative_mean[:max_samples],
    label="Cumulative Empirical Probability"
)

plt.axhline(
    overall_probability,
    linestyle="--",
    label="Overall Empirical Probability"
)

plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title(
    "Law of Large Numbers Demonstration\n"
    "Empirical Probability (First 10000)"
)
plt.legend()
plt.show()

# Plot
plt.figure()
plt.plot(sample_sizes, cumulative_mean, label="Cumulative Empirical Probability")
plt.axhline(
    overall_probability,
    linestyle="--",
    label="Overall Empirical Probability"
)
plt.xlabel("Sample Size")
plt.ylabel("Empirical Probability of Severe Crash")
plt.title("Law of Large Numbers Demonstration\nConvergence of Empirical Probability (Full Data Set)")
plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Observations¶

When considering only the first 1,000 observations, the empirical probability appears to converge
toward a lower value. As the sample size increases, the estimate exhibits decreasing fluctuations
and begins to stabilize. Around 10,000 observations, the empirical probability converges closely
to the overall value, illustrating the stabilizing effect predicted by the Law of Large Numbers.
Additionally, the apparent convergence observed at small sample sizes is unstable and sensitive to
random variation, emphasizing that early estimates may be misleading without sufficiently
large samples.

Central Limit Theorem application¶

In [39]:
np.random.seed(42)

# Indicator variable: 1 = severe crash, 0 = non-severe crash
data = severe_crash_event.astype(int).values

# Number of bootstrap samples
num_samples = 1000

# Sample sizes
n_30 = 30
n_1000 = 1000
n_10000 = 10000

# Sampling distributions
sample_means_30 = [
    np.mean(np.random.choice(data, size=n_30, replace=True))
    for _ in range(num_samples)
]

sample_means_1000 = [
    np.mean(np.random.choice(data, size=n_1000, replace=True))
    for _ in range(num_samples)
]

sample_means_10000 = [
    np.mean(np.random.choice(data, size=n_10000, replace=True))
    for _ in range(num_samples)
]

# Plot: Sampling distribution (n = 30)
plt.figure()
plt.hist(sample_means_30, bins=30)
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Sampling Distribution of the Mean (n = 30)")
plt.show()

# Plot: Sampling distribution (n = 1000)
plt.figure()
plt.hist(sample_means_1000, bins=30)
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Sampling Distribution of the Mean (n = 1000)")
plt.show()

# Plot: Sampling distribution (n = 10000)
plt.figure()
plt.hist(sample_means_10000, bins=30)
plt.xlabel("Sample Mean")
plt.ylabel("Frequency")
plt.title("Sampling Distribution of the Mean (n = 10000)")
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Observations¶
  • Each histogram represents the distribution of sample means
  • Not the distribution of individual crashes

Effect of sample size:

  • For 𝑛=30: wider, more irregular distribution, looks more like an exponential function
  • For 𝑛=1000: narrower and more symmetric, but still with big spikes
  • For 𝑛=1000: even more symmetric, a normal shape is distinguishable

Despite the binary nature of the original data, the sampling distribution of the mean approaches a normal shape

Observations and Sanity check¶

Observations: The Law of Large Numbers is supported by the observed convergence of the empirical probability of a
severe crash as the sample size increases. While early estimates fluctuate substantially, the
cumulative mean stabilizes as more observations are included, indicating convergence toward a stable
long-run value.

The Central Limit Theorem is illustrated by the behavior of the sampling distributions of the sample
mean. As sample size increases, the distributions become more concentrated and symmetric, consistent
with the theoretical prediction that sample means approach a normal distribution even when the
underlying data are not normally distributed.

Sanity Check:

  • The observations were randomly shuffled prior to analysis to avoid artifacts caused by ordering
    effects in the dataset.
  • Multiple sample sizes were examined to verify that convergence and distributional changes were not
    specific to a single choice of sample size.
  • A large number of repeated samples were used to ensure that the observed sampling distributions
    were stable.

What would Invalidate These Conclusions:

  • Strong dependence between observations (e.g., repeated crashes involving the same drivers or
    locations) would violate the independence assumption underlying both the Law of Large Numbers and
    the Central Limit Theorem.
  • Severe underreporting or systematic misclassification of crash severity would distort the
    empirical distributions and bias convergence behavior.
  • Insufficient sample sizes would undermine the reliability of the observed convergence and sampling distributions.

E. Regression and Predictive Modeling (45 points)¶


Assigned to Lorenz

  • Define a prediction target and features (justify why they make sense) (10 points)
  • Linear or polynomial model selection (include rationale and show at least two candidates) (10 points)
  • Model fitting and validation (train-test split appropriate for time-series. e.g., time-based split) (15 points)
  • Residual analysis and interpretation (errors, bias, failure cases, what to improve next) (10 points)
In [40]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

create new feature for prediction: injuries_per_day

In [41]:
# Create a new dataframe with daily total injuries

df_daily_data = df_csv_data_cleaned.groupby(df_csv_data_cleaned['crash_date'].dt.date)['injuries_total'].sum().reset_index()

# apply a rolling mean with a window of 7 days
df_daily_data.columns = ['crash_date', 'injuries_total']
df_daily_data['injuries_total'] = df_daily_data['injuries_total'].rolling(window=7, center=True).mean()

# Remove all entries where there is a NaN value
df_daily_data = df_daily_data.dropna()

df_daily_data["crash_date"] = pd.to_datetime(df_daily_data["crash_date"])
df_daily_data["dayofweek"] = df_daily_data["crash_date"].dt.day_name()
df_daily_data["day_of_year"] = df_daily_data["crash_date"].dt.dayofyear
df_daily_data["quarter"] = df_daily_data["crash_date"].dt.quarter

plt.figure(figsize=(14, 6))
plt.plot(df_daily_data['crash_date'], df_daily_data['injuries_total'], linewidth=1.5)
plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Daily Total Injuries Over Time')
plt.tight_layout()
plt.show()
No description has been provided for this image

Train-Test Split

In [42]:
df_daily_data["crash_date"] = pd.to_datetime(df_daily_data["crash_date"])

train = df_daily_data[df_daily_data["crash_date"] < "2024-01-01"]
test = df_daily_data[df_daily_data["crash_date"] >= "2024-01-01"]

# Visualization: Train and test data on a timeline
plt.figure(figsize=(14, 6))

# Plot train data
plt.plot(train["crash_date"], train["injuries_total"], label="Train Data", linewidth=1.5)

# Plot test data
plt.plot(test["crash_date"], test["injuries_total"], label="Test Data", linewidth=1.5)

# Add vertical line at split point
split_date = pd.to_datetime("2024-01-01")
plt.axvline(x=split_date, color="black", linestyle="--", linewidth=2, label="Train-Test Split")

plt.xlabel("Date")
plt.ylabel("Daily Total Injuries")
plt.title("Train-Test Split: Daily Total Injuries Over Time")
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Create Model with Improved Features

In [43]:
from sklearn.preprocessing import LabelEncoder

# Use calendar features with day_of_year for smooth seasonal trends (no month steps)
df_daily_data_clean = df_daily_data.copy()

# Recreate train/test split
train = df_daily_data_clean[df_daily_data_clean["crash_date"] < "2024-01-01"]
test = df_daily_data_clean[df_daily_data_clean["crash_date"] >= "2024-01-01"]

FEATURES = ["dayofweek", "day_of_year", "quarter"]
TARGET = "injuries_total"

# Encode dayofweek to numeric
le = LabelEncoder()
X_train = train[FEATURES].copy()
X_train["dayofweek"] = le.fit_transform(X_train["dayofweek"])

X_test = test[FEATURES].copy()
X_test["dayofweek"] = le.transform(X_test["dayofweek"])

y_train = train[TARGET]
y_test = test[TARGET]

reg = xgb.XGBRegressor(
    n_estimators = 100,
    early_stopping_rounds = 15,
    learning_rate = 0.1,
    max_depth = 3,
    subsample = 0.8,
    colsample_bytree = 0.8,
    reg_lambda = 2.0,
    reg_alpha = 0.5
)

reg.fit(
    X_train, 
    y_train, 
    eval_set=[(X_train, y_train), (X_test, y_test)], 
    verbose=True
)
[0]	validation_0-rmse:5.00436	validation_1-rmse:6.88749
[1]	validation_0-rmse:4.82513	validation_1-rmse:6.71825
[2]	validation_0-rmse:4.67464	validation_1-rmse:6.57536
[3]	validation_0-rmse:4.54441	validation_1-rmse:6.45460
[4]	validation_0-rmse:4.46671	validation_1-rmse:6.37398
[5]	validation_0-rmse:4.37048	validation_1-rmse:6.27648
[6]	validation_0-rmse:4.29130	validation_1-rmse:6.19983
[7]	validation_0-rmse:4.22200	validation_1-rmse:6.11911
[8]	validation_0-rmse:4.16701	validation_1-rmse:6.05709
[9]	validation_0-rmse:4.13661	validation_1-rmse:6.00976
[10]	validation_0-rmse:4.09643	validation_1-rmse:5.95536
[11]	validation_0-rmse:4.06082	validation_1-rmse:5.91021
[12]	validation_0-rmse:4.03126	validation_1-rmse:5.87421
[13]	validation_0-rmse:4.00654	validation_1-rmse:5.84391
[14]	validation_0-rmse:3.98640	validation_1-rmse:5.82141
[15]	validation_0-rmse:3.97628	validation_1-rmse:5.79614
[16]	validation_0-rmse:3.96733	validation_1-rmse:5.77768
[17]	validation_0-rmse:3.94912	validation_1-rmse:5.77308
[18]	validation_0-rmse:3.94325	validation_1-rmse:5.75868
[19]	validation_0-rmse:3.92797	validation_1-rmse:5.75196
[20]	validation_0-rmse:3.92371	validation_1-rmse:5.74371
[21]	validation_0-rmse:3.92101	validation_1-rmse:5.73982
[22]	validation_0-rmse:3.91199	validation_1-rmse:5.73943
[23]	validation_0-rmse:3.90146	validation_1-rmse:5.73079
[24]	validation_0-rmse:3.89267	validation_1-rmse:5.72549
[25]	validation_0-rmse:3.88537	validation_1-rmse:5.70949
[26]	validation_0-rmse:3.88403	validation_1-rmse:5.70996
[27]	validation_0-rmse:3.88296	validation_1-rmse:5.70569
[28]	validation_0-rmse:3.88202	validation_1-rmse:5.70149
[29]	validation_0-rmse:3.87707	validation_1-rmse:5.69936
[30]	validation_0-rmse:3.87110	validation_1-rmse:5.69286
[31]	validation_0-rmse:3.86678	validation_1-rmse:5.69381
[32]	validation_0-rmse:3.86289	validation_1-rmse:5.69364
[33]	validation_0-rmse:3.85837	validation_1-rmse:5.68763
[34]	validation_0-rmse:3.85154	validation_1-rmse:5.68485
[35]	validation_0-rmse:3.84504	validation_1-rmse:5.68268
[36]	validation_0-rmse:3.84478	validation_1-rmse:5.67924
[37]	validation_0-rmse:3.83791	validation_1-rmse:5.67884
[38]	validation_0-rmse:3.83275	validation_1-rmse:5.67718
[39]	validation_0-rmse:3.82964	validation_1-rmse:5.67527
[40]	validation_0-rmse:3.82608	validation_1-rmse:5.67351
[41]	validation_0-rmse:3.82587	validation_1-rmse:5.67298
[42]	validation_0-rmse:3.82370	validation_1-rmse:5.67920
[43]	validation_0-rmse:3.82193	validation_1-rmse:5.67745
[44]	validation_0-rmse:3.81928	validation_1-rmse:5.67008
[45]	validation_0-rmse:3.81697	validation_1-rmse:5.67692
[46]	validation_0-rmse:3.81572	validation_1-rmse:5.68117
[47]	validation_0-rmse:3.81559	validation_1-rmse:5.67576
[48]	validation_0-rmse:3.81339	validation_1-rmse:5.68105
[49]	validation_0-rmse:3.81343	validation_1-rmse:5.67650
[50]	validation_0-rmse:3.81159	validation_1-rmse:5.67249
[51]	validation_0-rmse:3.80964	validation_1-rmse:5.67914
[52]	validation_0-rmse:3.80964	validation_1-rmse:5.67846
[53]	validation_0-rmse:3.80966	validation_1-rmse:5.67952
[54]	validation_0-rmse:3.80955	validation_1-rmse:5.68272
[55]	validation_0-rmse:3.80947	validation_1-rmse:5.68329
[56]	validation_0-rmse:3.80945	validation_1-rmse:5.68478
[57]	validation_0-rmse:3.80943	validation_1-rmse:5.68274
[58]	validation_0-rmse:3.80712	validation_1-rmse:5.68209
[59]	validation_0-rmse:3.80558	validation_1-rmse:5.68440
Out[43]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.8, device=None, early_stopping_rounds=15,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=0.1, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=3,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=100,
             n_jobs=None, num_parallel_tree=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
objective objective: typing.Union[str, xgboost.sklearn._SklObjWProto, typing.Callable[[typing.Any, typing.Any], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]

Specify the learning task and the corresponding learning objective or a custom
objective function to be used.

For custom objective, see :doc:`/tutorials/custom_metric_obj` and
:ref:`custom-obj-metric` for more information, along with the end note for
function signatures.
'reg:squarederror'
base_score base_score: typing.Union[float, typing.List[float], NoneType]

The initial prediction score of all instances, global bias.
None
booster None
callbacks callbacks: typing.Optional[typing.List[xgboost.callback.TrainingCallback]]

List of callback functions that are applied at end of each iteration.
It is possible to use predefined callbacks by using
:ref:`Callback API `.

.. note::

States in callback are not preserved during training, which means callback
objects can not be reused for multiple training sessions without
reinitialization or deepcopy.

.. code-block:: python

for params in parameters_grid:
# be sure to (re)initialize the callbacks before each run
callbacks = [xgb.callback.LearningRateScheduler(custom_rates)]
reg = xgboost.XGBRegressor(**params, callbacks=callbacks)
reg.fit(X, y)
None
colsample_bylevel colsample_bylevel: typing.Optional[float]

Subsample ratio of columns for each level.
None
colsample_bynode colsample_bynode: typing.Optional[float]

Subsample ratio of columns for each split.
None
colsample_bytree colsample_bytree: typing.Optional[float]

Subsample ratio of columns when constructing each tree.
0.8
device device: typing.Optional[str]

.. versionadded:: 2.0.0

Device ordinal, available options are `cpu`, `cuda`, and `gpu`.
None
early_stopping_rounds early_stopping_rounds: typing.Optional[int]

.. versionadded:: 1.6.0

- Activates early stopping. Validation metric needs to improve at least once in
every **early_stopping_rounds** round(s) to continue training. Requires at
least one item in **eval_set** in :py:meth:`fit`.

- If early stopping occurs, the model will have two additional attributes:
:py:attr:`best_score` and :py:attr:`best_iteration`. These are used by the
:py:meth:`predict` and :py:meth:`apply` methods to determine the optimal
number of trees during inference. If users want to access the full model
(including trees built after early stopping), they can specify the
`iteration_range` in these inference methods. In addition, other utilities
like model plotting can also use the entire model.

- If you prefer to discard the trees after `best_iteration`, consider using the
callback function :py:class:`xgboost.callback.EarlyStopping`.

- If there's more than one item in **eval_set**, the last entry will be used for
early stopping. If there's more than one metric in **eval_metric**, the last
metric will be used for early stopping.
15
enable_categorical enable_categorical: bool

See the same parameter of :py:class:`DMatrix` for details.
False
eval_metric eval_metric: typing.Union[str, typing.List[typing.Union[str, typing.Callable]], typing.Callable, NoneType]

.. versionadded:: 1.6.0

Metric used for monitoring the training result and early stopping. It can be a
string or list of strings as names of predefined metric in XGBoost (See
:doc:`/parameter`), one of the metrics in :py:mod:`sklearn.metrics`, or any
other user defined metric that looks like `sklearn.metrics`.

If custom objective is also provided, then custom metric should implement the
corresponding reverse link function.

Unlike the `scoring` parameter commonly used in scikit-learn, when a callable
object is provided, it's assumed to be a cost function and by default XGBoost
will minimize the result during early stopping.

For advanced usage on Early stopping like directly choosing to maximize instead
of minimize, see :py:obj:`xgboost.callback.EarlyStopping`.

See :doc:`/tutorials/custom_metric_obj` and :ref:`custom-obj-metric` for more
information.

.. code-block:: python

from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
X, y = load_diabetes(return_X_y=True)
reg = xgb.XGBRegressor(
tree_method="hist",
eval_metric=mean_absolute_error,
)
reg.fit(X, y, eval_set=[(X, y)])
None
feature_types feature_types: typing.Optional[typing.Sequence[str]]

.. versionadded:: 1.7.0

Used for specifying feature types without constructing a dataframe. See
the :py:class:`DMatrix` for details.
None
feature_weights feature_weights: Optional[ArrayLike]

Weight for each feature, defines the probability of each feature being selected
when colsample is being used. All values must be greater than 0, otherwise a
`ValueError` is thrown.
None
gamma gamma: typing.Optional[float]

(min_split_loss) Minimum loss reduction required to make a further partition on
a leaf node of the tree.
None
grow_policy grow_policy: typing.Optional[str]

Tree growing policy.

- depthwise: Favors splitting at nodes closest to the node,
- lossguide: Favors splitting at nodes with highest loss change.
None
importance_type None
interaction_constraints interaction_constraints: typing.Union[str, typing.List[typing.Tuple[str]], NoneType]

Constraints for interaction representing permitted interactions. The
constraints must be specified in the form of a nested list, e.g. ``[[0, 1], [2,
3, 4]]``, where each inner list is a group of indices of features that are
allowed to interact with each other. See :doc:`tutorial
` for more information
None
learning_rate learning_rate: typing.Optional[float]

Boosting learning rate (xgb's "eta")
0.1
max_bin max_bin: typing.Optional[int]

If using histogram-based algorithm, maximum number of bins per feature
None
max_cat_threshold max_cat_threshold: typing.Optional[int]

.. versionadded:: 1.7.0

.. note:: This parameter is experimental

Maximum number of categories considered for each split. Used only by
partition-based splits for preventing over-fitting. Also, `enable_categorical`
needs to be set to have categorical feature support. See :doc:`Categorical Data
` and :ref:`cat-param` for details.
None
max_cat_to_onehot max_cat_to_onehot: Optional[int]

.. versionadded:: 1.6.0

.. note:: This parameter is experimental

A threshold for deciding whether XGBoost should use one-hot encoding based split
for categorical data. When number of categories is lesser than the threshold
then one-hot encoding is chosen, otherwise the categories will be partitioned
into children nodes. Also, `enable_categorical` needs to be set to have
categorical feature support. See :doc:`Categorical Data
` and :ref:`cat-param` for details.
None
max_delta_step max_delta_step: typing.Optional[float]

Maximum delta step we allow each tree's weight estimation to be.
None
max_depth max_depth: typing.Optional[int]

Maximum tree depth for base learners.
3
max_leaves max_leaves: typing.Optional[int]

Maximum number of leaves; 0 indicates no limit.
None
min_child_weight min_child_weight: typing.Optional[float]

Minimum sum of instance weight(hessian) needed in a child.
None
missing missing: float

Value in the data which needs to be present as a missing value. Default to
:py:data:`numpy.nan`.
nan
monotone_constraints monotone_constraints: typing.Union[typing.Dict[str, int], str, NoneType]

Constraint of variable monotonicity. See :doc:`tutorial `
for more information.
None
multi_strategy multi_strategy: typing.Optional[str]

.. versionadded:: 2.0.0

.. note:: This parameter is working-in-progress.

The strategy used for training multi-target models, including multi-target
regression and multi-class classification. See :doc:`/tutorials/multioutput` for
more information.

- ``one_output_per_tree``: One model for each target.
- ``multi_output_tree``: Use multi-target trees.
None
n_estimators n_estimators: typing.Optional[int]

Number of gradient boosted trees. Equivalent to number of boosting
rounds.
100
n_jobs n_jobs: typing.Optional[int]

Number of parallel threads used to run xgboost. When used with other
Scikit-Learn algorithms like grid search, you may choose which algorithm to
parallelize and balance the threads. Creating thread contention will
significantly slow down both algorithms.
None
num_parallel_tree None
random_state random_state: typing.Union[numpy.random.mtrand.RandomState, numpy.random._generator.Generator, int, NoneType]

Random number seed.

.. note::

Using gblinear booster with shotgun updater is nondeterministic as
it uses Hogwild algorithm.
None
reg_alpha reg_alpha: typing.Optional[float]

L1 regularization term on weights (xgb's alpha).
0.5
reg_lambda reg_lambda: typing.Optional[float]

L2 regularization term on weights (xgb's lambda).
2.0
sampling_method sampling_method: typing.Optional[str]

Sampling method. Used only by the GPU version of ``hist`` tree method.

- ``uniform``: Select random training instances uniformly.
- ``gradient_based``: Select random training instances with higher probability
when the gradient and hessian are larger. (cf. CatBoost)
None
scale_pos_weight scale_pos_weight: typing.Optional[float]

Balancing of positive and negative weights.
None
subsample subsample: typing.Optional[float]

Subsample ratio of the training instance.
0.8
tree_method tree_method: typing.Optional[str]

Specify which tree method to use. Default to auto. If this parameter is set to
default, XGBoost will choose the most conservative option available. It's
recommended to study this option from the parameters document :doc:`tree method
`
None
validate_parameters validate_parameters: typing.Optional[bool]

Give warnings for unknown parameter.
None
verbosity verbosity: typing.Optional[int]

The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
None
In [44]:
# Only create predictions for test data
test_with_pred = test.copy()
test_with_pred["prediction"] = reg.predict(X_test)

# Plot full timeline
plt.figure(figsize=(16, 6))
plt.plot(train["crash_date"], train["injuries_total"], label='Training Data', linewidth=1.5, alpha=0.7)
plt.plot(test["crash_date"], test["injuries_total"], label='Test Data (Actual)', linewidth=1.5, alpha=0.7)
plt.plot(test_with_pred["crash_date"], test_with_pred["prediction"], label='Predictions', linewidth=1.5, linestyle='--', alpha=0.7)

# Add vertical line at train-test split
split_date = pd.to_datetime("2024-01-01")
plt.axvline(x=split_date, color="red", linestyle="--", linewidth=2, label="Train-Test Split")

plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Full Timeline: Original Data vs Predictions')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [45]:
test["prediction"] = reg.predict(X_test)

plt.figure(figsize=(14, 6))
plt.plot(test["crash_date"], test["injuries_total"], label='Original Data', linewidth=1.5)
plt.plot(test["crash_date"], test["prediction"], label='Predictions', linewidth=1.5, linestyle='--')
plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Original Data vs Predictions (2024)')
plt.legend()
plt.tight_layout()
plt.show()

# --- Evaluation metrics (test set) ---
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Prepare arrays
y_true = test["injuries_total"].values
y_pred = test["prediction"].values
No description has been provided for this image
In [46]:
plt.figure(figsize=(14, 6))
plt.scatter(test["crash_date"], test["injuries_total"], label='Original Data', alpha=0.6)
plt.scatter(test["crash_date"], test["prediction"], label='Predictions', alpha=0.6)
plt.xlabel('Date')
plt.ylabel('Daily Total Injuries')
plt.title('Original Data vs Predictions (2024)')
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [47]:
# Metrics
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
# Handle zeros for MAPE
nonzero_mask = y_true != 0
mape = (np.abs((y_true[nonzero_mask] - y_pred[nonzero_mask]) / y_true[nonzero_mask]).mean() * 100) if nonzero_mask.any() else np.nan

print(f"Test RMSE: {rmse:.3f}")
print(f"Test MAE:  {mae:.3f}")
print(f"Test R^2:  {r2:.3f}")
print(f"Test MAPE: {mape:.2f}%\n")

# Residuals distribution
residuals = y_true - y_pred
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Residuals histogram
sns.histplot(residuals, kde=True, bins=30, ax=axes[0])
axes[0].set_title('Residuals Distribution (Test set)')
axes[0].set_xlabel('Residual (Actual - Predicted)')

# Right plot: Predicted vs Actual scatter
axes[1].scatter(y_true, y_pred, alpha=0.6)
lims = [min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())]
axes[1].plot(lims, lims, 'k--', linewidth=1)
axes[1].set_xlabel('Actual')
axes[1].set_ylabel('Predicted')
axes[1].set_title('Predicted vs Actual (Test set)')
axes[1].set_aspect('equal', adjustable='box')

plt.tight_layout()
plt.show()
Test RMSE: 5.670
Test MAE:  4.522
Test R^2:  0.129
Test MAPE: 13.52%

No description has been provided for this image

The model was able to roughly predict the values. There are visible plateaus in the predicted data. The reason for that is that not enough features were supplied to the model.

The predicted values are usually also smaller than the actual values. The reason for this is that in the year of 2024 more injuries occured than in the previos years the model was trained on.

Possible improvements are that more features should be used to discribe the fluctuations. Also the input data could be pre-processed better. For example outliers could be removed.


F. Dimensionality Reduction and Statistical Tests (40 points)¶


Assigned to Jeremia

Dimensionality Reduction (25 points)¶

  • PCA projection and interpretation (variance explained, what clusters or separations mean) (10 points)
  • t-SNE embedding with justified hyperparameters (perplexity or similar) and interpretation (7 points)
  • UMAP embedding with justified hyperparameters (neighbors, min dist or similar) and interpretation (8 points)

PCA Projection¶

In [48]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Select numeric columns
num_cols = [
    "injuries_total", "injuries_fatal", "injuries_incapacitating",
    "injuries_non_incapacitating", "injuries_reported_not_evident",
    "injuries_no_indication", "num_units", "crash_hour",
    "crash_day_of_week", "crash_month"
]
X = df_csv_data_cleaned.reset_index()[num_cols].dropna()

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA (all components first to see variance)
pca = PCA()
pca.fit(X_scaled)

# Variance explained
plt.figure()
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")
plt.title("PCA: Variance Explained by Component")
plt.show()

print("Cumulative variance explained:", np.cumsum(pca.explained_variance_ratio_))
No description has been provided for this image
Cumulative variance explained: [0.22040174 0.33902078 0.44478708 0.54853593 0.64912358 0.74913841
 0.84401122 0.93747164 1.         1.        ]
In [49]:
pca2 = PCA(n_components=2)
X_pca = pca2.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.3, s=5)
plt.xlabel(f"PC1 ({pca2.explained_variance_ratio_[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({pca2.explained_variance_ratio_[1]*100:.1f}%)")
plt.title("PCA projection (2D)")
plt.show()
No description has been provided for this image
In [50]:
# Loadings
loadings = pd.DataFrame(
    pca2.components_.T,
    columns=["PC1", "PC2"],
    index=num_cols
)
print(loadings)

# Color by crash_type (align indices)
labels = df_csv_data_cleaned.reset_index().loc[X.index, "crash_type"]
plt.figure(figsize=(8, 6))
for label in labels.unique():
    mask = labels == label
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel(f"PC1 ({pca2.explained_variance_ratio_[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({pca2.explained_variance_ratio_[1]*100:.1f}%)")
plt.title("PCA projection colored by crash type")
plt.legend()
plt.show()
                                    PC1       PC2
injuries_total                 0.656960  0.080359
injuries_fatal                 0.088758  0.092885
injuries_incapacitating        0.233514  0.068527
injuries_non_incapacitating    0.508293  0.038661
injuries_reported_not_evident  0.338789  0.044370
injuries_no_indication        -0.341977  0.566568
num_units                      0.119696  0.764198
crash_hour                    -0.030383  0.246612
crash_day_of_week             -0.020527  0.103898
crash_month                    0.015386  0.012061
No description has been provided for this image
In [51]:
# Color PCA projection by first_crash_type (aligned to X index)
labels_first = df_csv_data_cleaned.reset_index().loc[X.index, "most_severe_injury"]

plt.figure(figsize=(8, 6))
for label in labels_first.unique():
    mask_fc = labels_first == label
    plt.scatter(X_pca[mask_fc, 0], X_pca[mask_fc, 1], alpha=0.3, s=5, label=label)

plt.xlabel(f"PC1 ({pca2.explained_variance_ratio_[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({pca2.explained_variance_ratio_[1]*100:.1f}%)")
plt.title("PCA projection colored by most severe injury")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=6)
plt.tight_layout()
plt.show()
No description has been provided for this image

t-SNE embedding¶

In [52]:
from sklearn.manifold import TSNE

# t-SNE Hyperparameter Justification:
# -----------------------------------
# perplexity=30: Balances local vs global structure. Typical range is 5-50.
#   - Lower perplexity focuses on local neighborhoods
#   - Higher perplexity captures more global structure
#   - 30 is a common default that works well for medium-to-large datasets
#
# n_iter=1000: Sufficient iterations for convergence (default)
#
# learning_rate='auto': Automatically set to N/early_exaggeration/4 (recommended)
#
# random_state=42: For reproducibility
#
# Sampling: t-SNE has O(N²) complexity, so we sample 10,000 points
#   for computational tractability while preserving structure

# Sample data for t-SNE (full dataset is too large)
sample_size = 10000
np.random.seed(42)
sample_idx = np.random.choice(len(X_scaled), size=sample_size, replace=False)
X_sample = X_scaled[sample_idx]

# Fit t-SNE with justified hyperparameters
tsne = TSNE(
    n_components=2,
    perplexity=30,       # Balance between local/global structure
    max_iter=1000,       # Standard iterations for convergence
    learning_rate='auto',
    random_state=42
)
X_tsne = tsne.fit_transform(X_sample)

print(f"t-SNE completed. Final KL divergence: {tsne.kl_divergence_:.4f}")

# Plot t-SNE embedding
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.3, s=5)
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title(f"t-SNE projection (perplexity=30, n={sample_size})")
plt.show()
Exception in thread Thread-5 (_readerthread):
Traceback (most recent call last):
  File "c:\Users\Jere\miniconda3\envs\NEWenv\Lib\threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "c:\Users\Jere\miniconda3\envs\NEWenv\Lib\threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "c:\Users\Jere\miniconda3\envs\NEWenv\Lib\subprocess.py", line 1599, in _readerthread
    buffer.append(fh.read())
                  ^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 108: invalid start byte
t-SNE completed. Final KL divergence: 0.8709
No description has been provided for this image
In [53]:
# Color t-SNE by crash_type
labels_sample = df_csv_data_cleaned.reset_index().loc[X.index[sample_idx], "crash_day_of_week"].values

plt.figure(figsize=(8, 6))
for label in np.unique(labels_sample):
    mask = labels_sample == label
    plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title("t-SNE projection colored by crash_day_of_week")
plt.legend()
plt.show()

# Color t-SNE by most_severe_injury
labels_injury = df_csv_data_cleaned.reset_index().loc[X.index[sample_idx], "most_severe_injury"].values

plt.figure(figsize=(8, 6))
for label in np.unique(labels_injury):
    mask = labels_injury == label
    plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title("t-SNE projection colored by most severe injury")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=6)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
In [54]:
# Compare different perplexity values to show effect on embedding
fig, axes = plt.subplots(1, 5, figsize=(18, 4))

perplexities = [10, 20, 30, 50, 80]

for ax, perp in zip(axes, perplexities):
    tsne_test = TSNE(
        n_components=2,
        perplexity=perp,
        max_iter=1000,
        learning_rate='auto',
        random_state=42
    )
    X_tsne_test = tsne_test.fit_transform(X_sample)
    
    ax.scatter(X_tsne_test[:, 0], X_tsne_test[:, 1], alpha=0.3, s=3)
    ax.set_xlabel("t-SNE 1")
    ax.set_ylabel("t-SNE 2")
    ax.set_title(f"Perplexity = {perp}\nKL div: {tsne_test.kl_divergence_:.3f}")

plt.suptitle("Effect of perplexity on t-SNE embedding", y=1.02)
plt.tight_layout()
plt.show()

# Interpretation:
# - Perplexity 10: Emphasizes very local structure, may fragment clusters
# - Perplexity 30: Balanced view, captures both local and some global structure
# - Perplexity 50: More global view, clusters may merge but overall shape preserved
No description has been provided for this image

UMAP embedding¶

In [55]:
import umap

# UMAP Hyperparameter Justification:
# -----------------------------------
# n_neighbors=15: Controls local vs global structure balance
#   - Lower values (5-10): Focus on very local structure, may fragment data
#   - Higher values (50+): Emphasize global structure, lose local detail
#   - 15 is a common default balancing local manifold structure with global topology
#
# min_dist=0.1: Controls how tightly points cluster
#   - Lower values (0.0-0.1): Allows tighter clustering, better for finding clusters
#   - Higher values (0.5-1.0): Spreads points more evenly, preserves topological structure
#   - 0.1 allows clusters to form while maintaining some separation
#
# metric='euclidean': Standard distance metric for standardized numerical data
#
# random_state=42: For reproducibility
#
# n_epochs=500: Sufficient for convergence on medium datasets
#
# Using same sample as t-SNE for fair comparison

# Fit UMAP with justified hyperparameters
umap_model = umap.UMAP(
    n_neighbors=15,      # Balance local/global structure
    min_dist=0.1,        # Allow tight clustering
    n_components=2,
    metric='euclidean',
    random_state=42,
    n_epochs=500
)
X_umap = umap_model.fit_transform(X_sample)

# Plot UMAP embedding
plt.figure(figsize=(8, 6))
plt.scatter(X_umap[:, 0], X_umap[:, 1], alpha=0.3, s=5)
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.title(f"UMAP projection (n_neighbors=15, min_dist=0.1, n={sample_size})")
plt.show()
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
No description has been provided for this image
In [56]:
# Color UMAP by crash_type
plt.figure(figsize=(8, 6))
for label in np.unique(labels_sample):
    mask = labels_sample == label
    plt.scatter(X_umap[mask, 0], X_umap[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.title("UMAP projection colored by crash_day_of_week")
plt.legend()
plt.show()

# Color UMAP by most_severe_injury
plt.figure(figsize=(8, 6))
for label in np.unique(labels_injury):
    mask = labels_injury == label
    plt.scatter(X_umap[mask, 0], X_umap[mask, 1], alpha=0.3, s=5, label=label)
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.title("UMAP projection colored by most severe injury")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=6)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
In [57]:
# Compare effect of n_neighbors on UMAP embedding
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

n_neighbors_values = [5, 15, 30, 50]

for ax, n_neigh in zip(axes, n_neighbors_values):
    umap_test = umap.UMAP(
        n_neighbors=n_neigh,
        min_dist=0.1,
        n_components=2,
        random_state=42,
        n_epochs=500
    )
    X_umap_test = umap_test.fit_transform(X_sample)
    
    ax.scatter(X_umap_test[:, 0], X_umap_test[:, 1], alpha=0.3, s=3)
    ax.set_xlabel("UMAP 1")
    ax.set_ylabel("UMAP 2")
    ax.set_title(f"n_neighbors = {n_neigh}")

plt.suptitle("Effect of n_neighbors on UMAP embedding (min_dist=0.1)", y=1.02)
plt.tight_layout()
plt.show()

# Interpretation:
# - n_neighbors=5: Very local focus, may over-fragment the data
# - n_neighbors=15: Good balance of local detail and global connectivity
# - n_neighbors=30: More global structure, clusters begin to merge
# - n_neighbors=50: Emphasizes global topology, local details smoothed out
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\spectral.py:548: UserWarning: Spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
No description has been provided for this image
In [58]:
# Compare effect of min_dist on UMAP embedding
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

min_dist_values = [0.0, 0.1, 0.25, 0.5]

for ax, m_dist in zip(axes, min_dist_values):
    umap_test = umap.UMAP(
        n_neighbors=15,
        min_dist=m_dist,
        n_components=2,
        random_state=42,
        n_epochs=500
    )
    X_umap_test = umap_test.fit_transform(X_sample)
    
    ax.scatter(X_umap_test[:, 0], X_umap_test[:, 1], alpha=0.3, s=3)
    ax.set_xlabel("UMAP 1")
    ax.set_ylabel("UMAP 2")
    ax.set_title(f"min_dist = {m_dist}")

plt.suptitle("Effect of min_dist on UMAP embedding (n_neighbors=15)", y=1.02)
plt.tight_layout()
plt.show()

# Interpretation:
# - min_dist=0.0: Tightest clustering, points packed densely in cluster cores
# - min_dist=0.1: Slightly relaxed, clusters visible with some internal structure
# - min_dist=0.25: More spread out, easier to see point density variations
# - min_dist=0.5: Uniform spread, preserves more of the topological structure
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
c:\Users\Jere\miniconda3\envs\NEWenv\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
No description has been provided for this image